Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batteries included python package #310

Merged
merged 20 commits into from
Jan 26, 2022
Merged

Batteries included python package #310

merged 20 commits into from
Jan 26, 2022

Conversation

jerinphilip
Copy link
Contributor

@jerinphilip jerinphilip commented Jan 24, 2022

This PR imports python bindings from https://github.com/jerinphilip/lemonade along with a bit of python code that makes this repository a little easier to start with.

The additions bring about the following that is of value:

  • Ability to fetch and prepare models from model-repositories (translateLocally’s browsermt models, opus models as they share browsermt’s JSON).
  • Python integration boosts experimental capabilities and hopefully brings more community attention.

Summary of Changes

  • bindings/python folder containing python bindings and additional python code for batteries included package.
  • 3rd_party/pybind11, new dependency.
  • doc/conf.py and doc/requirement.txt changes for packages (argparse docs).
  • doc/python.rst, wired to doc/index.rst to enable python documentation.
  • CMakeLists.txt changes to allow python to integrate into the existing source.
  • .github/workflows/python.yml imports workflows for building wheels on python.
  • .github/workflows/doc.yml job is moved into .github/workflows/python.yml to be ran after the wheel is built to tap into python documentation.

Python checks

All python code in this repository (including doc/conf.py) is made to adhere to the following:

  1. black formatting.
  2. isort with profile black.

In addition, since we lack serious tests a static type checker from google (pytype) is used to fish out stuff. Addition of a proper testing framework will only happen after this PR if at all. For now, command line is tested through GitHub CI for no error runs which should cover a decent chunk of code.

Where all does it work?

Currently supports Linux and at least one Mac (although adding more should be easier).

One build targets Colab (Could be useful for #207). MKL etc appears to be linked statically thus making it easy to distribute. The pre-built wheels have been tested to be smoothly running on:

  • Ubuntu 18.04 py3.6
  • Ubuntu 18.04 py3.7 (colab)
  • Ubuntu 20.04 py3.8
  • ArchLinux with py3.10 using the wheel built via Ubuntu.

@jerinphilip has had GLIBC version issues when trying Ubuntu 18.04 py3.8 with Ubuntu 20.04 built wheels. The general solution seems to be to go for a manylinux wheel built on a CentOS image. (Related: ugermann/ssplit-cpp#30). @jerinphilip is also squatting on pypi package name, for until the (manylinux) wheel is ready.

Fixes #234.

@jerinphilip jerinphilip changed the title Import python binding sources Import python sources Jan 24, 2022
@jerinphilip jerinphilip changed the title Import python sources Batteries included python package frontend Jan 24, 2022
@jerinphilip jerinphilip changed the title Batteries included python package frontend Batteries included python package Jan 24, 2022
* Remove doc for moving into python
* Include python.rst in toc
* Use 18.04 cp3.7 wheel
* Integrating python documentation
@jerinphilip jerinphilip marked this pull request as ready for review January 24, 2022 22:10
@jerinphilip
Copy link
Contributor Author

jerinphilip commented Jan 24, 2022

Default is a merge by next Monday in whatever state the code is. Should anyone have the time and courage to - please do look at the 1500 line diff and leave comments. I expect to address critical stuff. Non-critical comments will still be addressed, hopefully as smaller changes behind this PR. There are known rough edges, but what is here is a minimum-viable product.

@jelmervdl finds the prospect of using python HTML experimentation and evaluations useful. Life would be easier with 2 repositories than the 3 (bergamot-translator, lemonade, tagtransfer) we're working with.

@kpu has given a no-objection for python bindings sitting somewhere not slowing down the library source and has also kindly observed that this also happens to be python bindings for "marian", which bergamot is built on top of.

macOS buildability and python target only builds are courtesy @jelmervdl. A lot of workflow code and model zoo stuff is thanks to XapaJIaMnu/translateLocally - @jelmervdl and @XapaJIaMnu.

Copy link
Contributor Author

@jerinphilip jerinphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes to self... and call for help.

"browsermt", "https://translatelocally.com/models.json"
),
TranslateLocallyLike(
"opus", "https://object.pouta.csc.fi/OPUS-MT-models/app/models.json"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attribution...? Help on howto?

doc/conf.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
3rd_party/CMakeLists.txt Show resolved Hide resolved
XapaJIaMnu
XapaJIaMnu previously approved these changes Jan 25, 2022
Copy link
Collaborator

@XapaJIaMnu XapaJIaMnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the bergamot.cpp file, I approve.

bindings/python/__init__.py Outdated Show resolved Hide resolved
bindings/python/bergamot.cpp Show resolved Hide resolved
bindings/python/bergamot.cpp Outdated Show resolved Hide resolved
inventory = requests.get(self.url).text
with open(self.models_file_path, "w+") as models_file:
models_file.write(inventory)
self.data = json.loads(inventory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the models.json and model_info.json to contain repository information. Eg: https://translatelocally.com/models.json
https://github.com/browsermt/students/blob/master/enfa/enfa.student.tiny11/model_info.json#L7

doc/conf.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
MANIFEST.in Outdated Show resolved Hide resolved
bindings/python/__init__.py Outdated Show resolved Hide resolved
bindings/python/bergamot.cpp Outdated Show resolved Hide resolved
bindings/python/cmds.py Show resolved Hide resolved
Comment on lines 218 to 219
py::arg("numWorkers") = 1, py::arg("cacheEnabled") = false,
py::arg("cacheSize") = 20000, py::arg("cacheMutexBuckets") = 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does cacheSize = 20000 but cacheEnabled = false entail that it does allocate space for 20k entries, but never use it by default? Is cacheMutexBuckets=1 a good default if someone were to change the numWorkers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, w.r.t cacheMutexBuckets, I want to stop exposing them to the user, same as what translateLocally is doing. As for the cacheEnabled what do you think of wrapping cache in std::optional? It is also possible to set a per request enabled/disabled as well even if we've created the cache-store.

In any case, choosing to defer this for later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, w.r.t cacheMutexBuckets, I want to stop exposing them to the user, same as what translateLocally is doing.

I think this is a good idea. Like, there has to be a theoretical optimum value with respect to the number of threads and the cacheSize. bergamot-translator can figure out that relation. That shouldn't be up to the user.

As for the cacheEnabled what do you think of wrapping cache in std::optional? It is also possible to set a per request enabled/disabled as well even if we've created the cache-store.

My problem is that you allocate a 20.000 ptr vector by default, even when cacheEnabled is false. Easiest would be to just add a ternary operator here to set cacheSize to 0 and save a dynamic allocation for it when cacheEnabled is off.

Other options would be making cache a Ptr (so you can manage it on your own, like clearing it when switching models in translateLocally since you know that old entries wont be very relevant at that point, or keep it between AsyncService changes when changing number of workers or something) or keep it like it is now, but having cacheSize=0 shortcut all calls into it or something. Or indeed wrap it in a std::optional to save it from being initialised when cacheEnabled=0.

bindings/python/bergamot.cpp Outdated Show resolved Hide resolved
@jelmervdl
Copy link
Member

jelmervdl commented Jan 25, 2022

Tested it locally, and after applying the macOS patch it compiles and installs like a charm!

One thing I did notice as an annoyance while working in a jupyter notebook with it: You cannot recreate service due the logger issue (can't find the ticket for it anymore) where there can only be one instance of Logger at a time. This is annoying when re-executing a notebook cell. It works if you explicitly delete (i.e. del service) it before you initialize it.

edit: found it again! #290

@jerinphilip jerinphilip merged commit c0f311a into main Jan 26, 2022
@jerinphilip jerinphilip deleted the python branch January 26, 2022 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python bindings and a module
3 participants